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ABSTRACT 

Various 'omics' technologies, including microarrays 
and gas chromatography mass spectrometry, can 
be used to identify hundreds of interesting genes, 
proteins and metabolites, such as differential 
genes, proteins and metabolites associated with 
diseases. Identifying metabolic pathways has 
become an invaluable aid to understanding the 
genes and metabolites associated with studying 
conditions. However, the classical methods used 
to identify pathways fail to accurately consider 
joint power of interesting gene/metabolite and the 
key regions impacted by them within metabolic 
pathways. In this study, we propose a powerful ana- 
lytical method referred to as Subpathway-GM for 
the identification of metabolic subpathways. This 
provides a more accurate level of pathway analysis 
by integrating information from genes and metabol- 
ites, and their positions and cascade regions within 
the given pathway. We analyzed two colorectal 
cancer and one metastatic prostate cancer data 
sets and demonstrated that Subpathway-GM was 
able to identify disease-relevant subpathways 
whose corresponding entire pathways might be 
ignored using classical entire pathway identification 
methods. Further analysis indicated that the power 
of a joint genes/metabolites and subpathway 
strategy based on their topologies may play 



a key role in reliably recalling disease-relevant 
subpathways and finding novel subpathways. 

INTRODUCTION 

Various 'omics' technologies, such as microarrays, RNA- 
seq and gas chromatography mass spectrometry 
(GC-MS), can be used to identify potentially interesting 
(e.g. differential) genes and metabolites, including those 
associated with specific diseases. One of the challenges is 
to use such information to provide a better understanding 
of the underlying biological phenomena. Metabolic 
pathway analysis has become an invaluable aid to under- 
standing the molecules generated by these 'omics' 
technologies. Information on the metabolic pathways 
investigated is available via pathway databases, such as 
KEGG, which manually curates electronic high-quality 
pathway structure information on the enzymes and me- 
tabolites involved in the metabolic processes (1). One of 
the most widely applied pathway analysis methods is the 
overrepresentation approach (ORA), which compares the 
number of interesting genes (metabolites) that hit a given 
pathway with the number of genes (metabolites) expected 
to hit the given pathway by chance. If the observed 
number is significantly different from that expected by 
chance, the pathway is reported as significant. A statistical 
model, such as the hypergeometric test, can be used to 
calculate the enrichment significance (P- values). 

Many methods, including ORA and gene set enrich- 
ment analysis (GSEA), have been developed to identify 
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pathways [mainly reviewed in (2-7)]. However, these 
existing methods share obvious limitations in terms of 
their abilities to identify pathways, especially metabolic 
pathways (2,8,9). Because metabolic pathways involve 
both genes and metabolites, the dysfunction of which 
play important roles in diseases (10-12), integrative 
pathway analysis of both genes and metabolites will thus 
help to interpret the underlying biological phenomena. 
However, most existing methods ignore this integrative 
approach and were designed and developed mainly for 
the analysis of interesting genes (2^1), leading to a 
relative lack of information on metabolites in the 
pathways. 

Pathway identification methods based on metabolite 
lists, such as MBRole (5), MetPA (6) and MSEA (7), 
have recently been developed. However, these methods 
use only metabolite information, and thus also ignore 
the power of joint gene/metabolite analysis. Kamburov 
et al. recently developed a new metabolic pathway 
analysis method, IMPaLA, for the identification of bio- 
chemical pathways related to tumor-cell chemosensitivity. 
This method integrates enrichment significance (P-values) 
of pathways calculated through ORA based on interesting 
genes and interesting metabolites to improve pathway 
identification (10). Kamburov et al. (10) assumed inde- 
pendence of pathways associations from data sets of 
genes and metabolites, and the joint statistical significance 
was thus calculated by multiplying the P-values 
for genes and metabolites, respectively (,P-value gene x 

P-value metab oiite)- 

Although IMPaLA can effectively improve pathway 
identification, simply changing the statistical model 
might not fully reflect the superiority of a joint approach 
to analyzing genes and metabolites of interest within meta- 
bolic pathways. From a biological perspective, dysfunc- 
tional genes are closely related to dysfunctional 
metabolites in pathways, and biological features, such as 
their topology, might thus be more useful in helping to 
identify pathways than simply changing the statistical 
analysis (see examples in the 'Results' section). 
Improvements in pathway identification methods may 
need to be made at a fundamental biological level, 
rather than a statistical level (2). Moreover, fewer differ- 
ential metabolites are detected by metabolomic studies 
compared with differentially expressed genes. However, 
metabolites may be located at important positions in 
pathways, such as arachidonic acid in the arachidonic 
acid metabolism pathway. The importance of metabolites 
in pathway structure is ignored by most of the existing 
pathway identification methods. More importantly, meta- 
bolic pathways stored in pathway databases are often too 
large to allow the accurate interpretation of the relevant 
biological phenomena. Key subpathway regions represen- 
tative of the entire corresponding pathway may be more 
useful in terms of interpreting the relevant biological phe- 
nomena. Moreover, several studies have shown that 
abnormalities in subpathway regions of metabolic 
pathways contribute to the etiology of diseases (9,13,14), 
although most of the existing methods only identify entire 
pathways, rather than the key subpathway regions. We 
previously attempted to identify key metabolic 



subpathways using only gene information (13). However, 
the joint use of genes and metabolites has still been largely 
ignored, and the positional importance of genes and me- 
tabolites should be considered to locate key regions of 
pathways related to the underlying biological phenomena, 
such as specific diseases. 

In this study, we developed a powerful analytical method 
called Subpathway-GM for the identification of biologic- 
ally meaningful metabolic subpathways. Subpathway-GM 
integrates 'interesting genes 1 and 'interesting metabolites 1 
related to the study condition (e.g. disease) into the corres- 
ponding enzyme and metabolite nodes (referred to as sig- 
nature nodes) within the metabolic pathway. We then 
analyzed lenient distance similarities of signature nodes 
within the pathway structure to locate key metabolic 
cascade subpathway regions (see 'Materials and Methods 1 
section). Finally, a hypergeometric test was used to evaluate 
the enrichment significance of these subpathway regions. 
We applied Subpathway-GM to two colorectal cancer 
and one metastatic prostate cancer data sets and demo- 
nstrated that this method was able to identify disease- 
related subpathway regions successfully and reliably. 



MATERIALS AND METHODS 

Data sets 

We analyzed two colorectal cancer data sets and one 
metastatic prostate cancer data set. We obtained differen- 
tial genes from gene expression data and differential me- 
tabolites from metabolomic experimental studies. 

Colorectal cancer data set 1 

The gene expression profile data included 32 pairs of colo- 
rectal adenoma and adjacent normal mucosa tissues, ori- 
ginally analyzed by Sabates-Bellver et al. (15). The data 
are publicly available at the GEO database (GEO acces- 
sion number = GSE8671). We used both the significance 
analysis of microarray (SAM) method (16) and the 
fold-change (FC) method to identify differentially ex- 
pressed genes. A gene was considered to be differentially 
expressed when it was significant in the SAM method at a 
significance level of 0.001 (False discovery rate 
FDR < 0.001) and the log2| fold-change | value of the 
gene was also >1 (i.e. FC>2 or FC<0.5). A total of 
2053 differentially expressed genes were identified by the 
aforementioned strategies. Differential metabolites were 
directly obtained from the results of several experimental 
studies, including Chan et al. (17), Qiu et al. (18,19) and 
Denkert et al. (20). The metabolites were extracted from 
these articles and converted to KEGG compound IDs. 
Finally, 90 unique differential metabolites associated 
with colorectal cancer were obtained. 

Colorectal cancer data set 2 

The gene expression profile data, including 70 colorectal 
cancer samples and 12 normal samples, was initially 
analyzed by Hong et al. (21) (GEO accession 
number = GSE9348). We identified 1452 differentially 
expressed genes using the same strategy as for colorectal 
cancer data set 1 (FDR < 0.001 and FC > 2 or FC < 0.5). 
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The differential metabolites used were the same as in colo- 
rectal cancer data set 1 . 

Metastatic prostate cancer data set 

Gene expression profile data, including six benign prostate 
samples, seven clinically localized prostate cancer samples 
and six metastatic prostate cancer samples, were initially 
analyzed by Varambally et al. (22) (GEO accession 
number = GSE3325). The localized and metastatic 
prostate cancer samples were used to identify differentially 
expressed genes associated with metastatic prostate 
cancer, using the SAM method (FDR < 0.01) and the 
FC method (FC > 2 or FC < 0.5) simultaneously. A total 
of 1773 differential genes were identified. Multiple 
metabolomic profiles (liquid chromatography/GC-MS), 
including 16 benign adjacent prostate samples, 12 clinic- 
ally localized prostate cancer samples and 14 metastatic 
prostate cancer samples, were obtained from studies of 
Sreekumar et al. (12). The localized and metastatic 
samples were used to identify differentially expressed 
genes. We used the method (Wilcoxon rank-sum test) 
described by Sreekumar et al. to identify differential me- 
tabolites, and then converted these metabolite names to 
KEGG compound IDs. Finally, 53 metabolites associated 
with metastatic prostate cancer were identified (P<0.01; 
Wilcoxon rank-sum test). 

Methods 

Subpathway-GM has been implemented as a freely avail- 
able web-based and R-based tool (http://bioinfo.hrbmu 
.edu.cn/SubpathwayGM). Figure 1 depicts the schematic 
overview of Subpathway-GM. The step-by-step method is 
provided in Supplementary Text. The users inputs inter- 
esting the genes and metabolites of interest, and metabolic 
subpathways can then be identified mainly through: 
(i) mapping genes and metabolites of interest to graphs 
of pathways after graph-based reconstruction of metabolic 
pathways; (ii) locating subpathways within pathways 
according to signature nodes; (iii) evaluating the statistical 
significance of subpathways. The method is described in 
more detail later in the text. 

Map genes and metabolites of interest to graphs 
of pathways 

We converted all 150 metabolic pathways in KEGG (17 
December 2010) to the directed graphs based on biochem- 
ical reaction information in KGML files (for an XML 
representation of KEGG pathway information, http:// 
www.kegg.jp/kegg/xml/). For each pathway, we extracted 
all metabolites and enzymes within the pathway as nodes 
in the corresponding graph. If a metabolite participated in 
a reaction as a substrate or product, a directed edge was 
used to connect the corresponding metabolite nodes to the 
enzyme (i.e. reaction) nodes. Substrates were directed 
towards the enzyme nodes, whereas enzyme nodes were 
directed towards their products. Reversible reactions had 
twice as many edges as irreversible reactions. This strategy 
for converting pathway graphs had the advantage of de- 
veloping graph algorithms. More importantly, positional 
information for metabolites and genes encoding enzymes 
could be extracted efficiently and used via a graph model 



that keeps pathway structure. Interesting genes and me- 
tabolites can be then mapped to the corresponding nodes 
within the pathway graphs. Notably, interesting metabol- 
ites can be mapped directly to metabolite (substrate and 
product) nodes. Interesting genes can be assigned to 
Enzyme Commission (EC) (or KEGG Ontology (KO)) 
numbers and matched to enzyme nodes. KO numbers 
are recommended by KEGG for mapping genes to path- 
ways; therefore, we used KO numbers as the default 
numbers in Subpathway-GM. Finally, the mapped nodes 
within each pathway graph were defined as signature 
nodes. 

Locate subpathways within pathways according to 
signature nodes 

Signature nodes within pathways represent information 
on genes and metabolites of interest. These nodes can 
effectively help to locate subpathways associated with 
the genes and metabolites of interest, through further con- 
sidering their topologies within pathways. Furthermore, 
distances between some nodes in a subpathway are 
usually similar. We, therefore, used 'lenient' distance simi- 
larity of signature nodes to locate subpathways. 
Specifically, for each pathway that contained signature 
nodes, we computed the shortest path between any two 
signature nodes in the given pathway graph. If the shortest 
path between two signature nodes was shorter than n + 1 , 
then the two signature nodes and other non-signature 
nodes at their shortest path were added to the same 
node set. The parameter n indicates the maximum 
permitted non-signature node number at the shortest 
path between signature nodes. We then extracted the cor- 
responding subgraph in the pathway graph according to 
each node set, and finally defined these subgraphs with 
node number >s as the subpathway regions because 
subgraphs with small scales can not usually form biologic- 
ally relevant subpathway. Figure 1 shows the subpathways 
within entire pathways identified by Subpathway-GM 
with n = 2 and s = 5. 

Our subpathway strategy was based on lenient distance 
similarity. When a node within a pathway appeared within 
the permitted shortest path between any two signature 
nodes, the node was merged into the corresponding 
subpathway. If a node within the entire pathway had 
high topological centrality (e.g. high degree and/or 
betweenness), then that node would be more likely to 
occur within the permitted shortest path, and thus 
appear in the final subpathway. In metabolic pathways, 
nodes in the key regions usually have high topological 
centrality (e.g. the region near arachidonic acid in arachi- 
donic acid metabolism). These key nodes, therefore, tend 
to appear in subpathways even though they are not signa- 
ture nodes. Subpathway-GM will thus tend to identify key 
regions in entire pathways that might be representative of 
the corresponding entire pathways. 

Flexibility can be introduced to this subpathway 
strategy by varying the parameter n. A smaller value of 
n means that only those nodes meeting stricter distance 
similarities will be added to the corresponding subpath- 
way, and the identified subpathways thus become smaller 
compared with larger values of n. The number of 
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Figure 1. Schematic overview of Subpathway-GM. 



non-signature nodes within subpathways will also tend to 
reduce because of the smaller number of permitted 
non-signature nodes, and the ratio of signature nodes in 
the located subpathway regions will thus increase as n 



decreases. This can help users to identify subpathways 
closely associated with genes and metabolites of interest. 
In contrast, increasing n will usually increase the ratio of 
non-signature nodes and the size of the subpathways. Key 
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nodes within entire pathways will also tend to be added to 
the subpathway because of the high topological centrality 
of key nodes. Subpathway-GM will thus tend to identify 
the key regions in entire pathways that might be represen- 
tative of the corresponding entire pathways. 

Evaluate the statistical significance of subpathways 

We used the hypergeometric test to calculate the statistical 
significance of each subpathway. To achieve this, the fol- 
lowing values needed to be calculated: (i) the number of 
interesting genes and metabolites submitted for analysis; 
(ii) the number of background genes and metabolites; (hi) 
the number of background genes and metabolites 
annotated to each subpathway; and (iv) the number of 
interesting genes and metabolites annotated to each 
subpathway. The default background is usually con- 
sidered to be the whole genome and metabolome of the 
given organism (2). In this study, we, therefore, used all 
human genes in KEGG as the background genes, and all 
metabolites in the Human Metabolome Database 
(HMDB) (23) and KEGG Human Pathway (1) as back- 
ground metabolites because human metabolites are not 
included in KEGG. 

The following equation represents an example calcula- 
tion of the statistical significance of a subpathway. If the 
whole genome (metabolome) has a total of m g genes 
(m m metabolites), of which t g (t m ) are involved in the 
subpathway under investigation, and the set of genes (me- 
tabolites) submitted for analysis has a total of n g genes 
(«,„ metabolites), of which r g (/",„) are involved in the 
same subpathway, then the /"-value can be calculated to 
evaluate the enrichment significance of the subpathway as 
follows: 



1 - 



£ 

x=0 



t g +t„ 



m g +m„ 



n g +n,„ — x 



m g +m n 
n g +n m 



RESULTS 

We analyzed two colorectal cancer and one metastatic 
prostate cancer data sets using Subpathway-GM with 
the parameters n = 5 and s = 5. This study focused on 
identifying disease-related subpathways. We, therefore, 
set the parameter s = 5 because this type of subpathways 
(s > 5) has been reportedly associated with disease in some 
studies and is considered by many groups to represent a 
pathway. To set an appropriate n value for the identifica- 
tion of disease-related subpathways, we examined the dis- 
tances among known disease nodes within pathways based 
on disease genes and metabolites in the Genetic 
Association Database (GAD) (24) and HMDB (23) 
(Figure 2A). The average shortest distance among these 
nodes was 8.02, which was significantly smaller than that 
between all nodes within metabolic pathways 
(P<2.2E-16; Wilcoxon rank-sum test). We further 
computed the shortest distance between each disease 
node and its nearest disease node and found that the 
distance was <5 for 85% disease nodes (Figure 2B). 



Some studies have suggested that genes associated with 
the same disease show close tendencies in biological 
pathways, and that their biological functions tend to be 
similar (25-27). A value of n = 5 thus seems to represent 
the closeness of genes and metabolites in diseases. 

To compare Subpathway-GM with other methods at 
the system level, we used three specific ORAs to identify 
pathways for each data set: Pathway-G, Pathway-M and 
IMPaLA. These ORA methods are commonly used for 
pathway identification and have been included in many 
pathway analysis tools (2,6,7,10). Pathway-G uses only 
genes of interest to identify entire pathways via 
hypergeometric tests (2). Similar to Pathway-G, 
Pathway-M uses metabolites of interest (5-7), whereas 
IMPaLA integrates genes and metabolites of interest to 
identify entire pathways via hypergeometric tests but does 
not consider pathway structure and subpathway identifi- 
cation strategy (10). 

Colorectal cancer 1 

Our first example was chosen to illustrate the effectiveness 
of Subpathway-GM for identifying subpathways 
associated with colorectal cancer. Subpathway-GM 
located 56 potential subpathways from all metabolic 
pathways using 2053 differential genes and 90 differential 
metabolites. With a strict cut-off value of P<0.01, 
Subpathway-GM finally identified 26 significant metabolic 
subpathways (Figure 2C). When many subpathways are 
considered, a high-false-positive discovery rate is likely to 
result; therefore, we calculated FDR corrected P-values 
for these subpathways using the Benjamini-Hochberg 
FDR method (Supplementary Table SI). The results 
showed that 26 subpathways with P<0.01 remained sig- 
nificant at the strict cut-off of FDR < 0.02, suggesting a 
low-false discovery rate. We, therefore, continued to use 
the non-corrected P-values in the following analyses. 

The 26 significant metabolic subpathways identified cor- 
responded to 25 entire pathways (Figure 2C). Of these, up 
to 20 (80%) were well reported to be associated with 
cancer (indicated by black border in Figure 2D). 
Detailed evidences were provided in Supplementary 
Table SI. Several other pathways also interact with the 
reported cancer pathways, suggesting a possible associ- 
ation (Figure 2D and Supplementary Text). Many 
pathways identified by Subpathway-GM were undetected 
by Pathway-G, -M or IMPaLA. Pathway-G found only 
three significant pathways at the 1 % significance level (see 
Supplementary Data set SI), all of which were included in 
the 25 significant pathways found by Subpathway-GM 
(Figure 2C). However, Subpathway-GM identified an 
additional 22 (88%) not identified by Pathway-G (see 
Supplementary Data set SI). The powers of Pathway-M 
and IMPaLA also seemed to be limited, and 68 and 40% 
of the 25 significant pathways identified by Subpathway- 
GM were not considered as significant by Pathway-M 
and IMPaLA, respectively (see Supplementary Data 
set SI). Surprisingly, up to 10 pathways identified by 
Subpathway-GM were simultaneously ignored by 
Pathway-G, -M and IMPaLA (indicated by black label 
in Figure 2C). 
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Figure 2. Identification of metabolic subpathways associated with colorectal cancer. (A) Distances among known disease nodes within metabolic 
pathways. (B) Empirical cumulative distribution functions of shortest path lengths between each disease node and its nearest disease node within 
pathways. (C) Plots of pathway significance (-loglO P-value) in Subpathway-GM, Pathway-G, Pathway-M and IMPaLA. Subpathway-GM identified 
26 significant metabolic subpathways, corresponding to 25 entire pathways. Plus sign indicates that the pathway was identified by the corresponding 
method at the 1% significance level. Bold labels represent the additional pathways identified by Subpathway-GM. (D) Interaction network of the 
subpathway identified by Subpathway-GM. Two subpathways are connected by an edge if they share a non-empty intersection of metabolites or 
genes. Edge width between subpathways is proportional to the number of genes and metabolites shared by the two connected subpathways. Node 
size is proportional to the degree of the node. Node color reflects statistical significance of pathway (P-value). Subpathways well supported by 
existing literature are shown with a black border node. 



We focused on seven of these 10 additional pathways 
that contained both differential genes and metabolites 
(Table 1). These pathways had high P- values in 
Pathway-G, -M and IMPaLA, and we further tested 
their ranks in these pathways. The result showed that 
almost all pathways were ranked >25 in Pathway-G, -M 
and IMPaLA (Table 1), indicating that they might be 
ignored by Pathway-G, -M and IMPaLA from a 'rank' 
point of view. 

The most significant of the seven additional 
subpathways (path:00380_3) belonged to tryptophan me- 
tabolism pathway (Figure 3). Subpathway-GM yielded a 
P-value of 0.00037 (FDR corrected to 0.0029), but the 
tryptophan metabolism pathway was not considered as 
significant in Pathway-G, -M (i 5 >0.05) or IMPaLA 



(.P>0.01). Abnormality of the key tryptophan-serotonin 
regions (top region in Figure 3) has been reported to cause 
tumor cell proliferation in colon and prostate cancers 
[reviewed in (28)]. Moreover, the key region where tryp- 
tophan is converted by indoleamine-2,3-dioxygenase to 
kynurenime (red arrow region in Figure 3) was closely 
related to immune activation, cell proliferation and 
impaired quality of life in colorectal cancer [reviewed in 
(29)]. Recent studies also showed that the tryptophan-2,3- 
dioxygenase (TDO)-kynurenime region effectively sup- 
pressed antitumor immune responses and promoted 
tumor cell survival and motility (11). The differential me- 
tabolites and enzymes in tryptophan metabolism pathway 
encoded by differential expressed genes were located in 
the local cascade region and formed signature nodes in 



Page 7 of 14 Nucleic Acids Research, 2013, Vol. 41, No. 9 elOl 



Table 1. Seven additional subpathways identified by Subpathway-GM using colorectal cancer data set 1 



Subpathway ID 


Pathway name 


S-P(R) 


S-FDR 


I-P(R) 


G-P(R) 


M-/>(R) 


Representative 


Reference 


path:00380 3 


Tryptophan metabolism 


0.00037(7) 


0.0041 


0.037(42) 


0.095(24) 


0.38(48) 


YES 


(11,28,29,30,31) 


path:00010 1 


Glycolysis/gluconeogenesis 


0.0017(14) 


0.0070 


0.020(35) 


0.39(42) 


0.052(31) 




(17,18,32-34) 


path:00562 1 


Inositol phosphate metabolism 


0.0021(15) 


0.0081 


0.10(56) 


0.16(30) 


0.66(55) 




(32,33,35-37) 


path:00340 1 


Histidine metabolism 


0.0039(18) 


0.011 


0.017(33) 


0.049(14) 


0.34(46) 


YES 


(38,39) 


path:00590 1 


Arachidonic acid metabolism 


0.0040(19) 


0.011 


0.034(39) 


0.038(12) 


0.88(59) 


YES 


(40^2) 


path:00500 1 


Starch and sucrose metabolism 


0.0048(22) 


0.011 


0.049(46) 


0.12(26) 


0.40(49) 


YES 




path:00270 2 


Cysteine and methionine metabolism 


0.0051(24) 


0.011 


0.012(29) 


0.67(59) 


0.018(21) 


YES 


(43,44) 



S-P(R): P-values {P) and ranks (R) of pathways in Subpathway-GM; I-P(R), G-P(R), M-P(R): P-values (P) and ranks (R) for IMPaLA, Pathway-G 
and Pathway-M respectively; S-FDR: FDR corrected P-values of pathways in Subpathway-GM. 
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Figure 3. Tryptophan metabolism pathway where the differential genes and metabolites of colorectal cancer were annotated. Nodes near asterisk 
symbol belong to the key subpathway region (path:00380_3) identified by Subpathway-GM. Enzymes (rectangular nodes) mapped by differential 
genes are shown with red node labels and borders. Metabolites (circle nodes) mapped by differential metabolites were showed with red node borders. 



Subpathway-GM. Notably, differential tryptophan was at 
the center of the pathway. Subpathway-GM used distance 
similarity information between these signature nodes to 
identify the pathway, as well as the key subpathway 
region, effectively. 



The second subpathway (path:00010_l) belonged to the 
glycolysis/gluconeogenesis pathway, which is highly 
associated with cancer cell metabolism. During the early 
part of the 20th century, Warburg found that cancer cells 
consume glucose and acidify their environment with 
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lactate (Warburg effect) (32,33). The differential genes and 
metabolites on the colorectal cancer data set were mostly 
located in the pyruvate and lactate metabolism region of 
the pathway, which was successfully identified by Sub- 
pathway-GM (Supplementary Figure SI). Many studies 
also indicated that this region was closely related to 
energy demand of colorectal cancer tissues (17,18,34). 

The third subpathway (path:00562_l) belonged to the 
inositol phosphate metabolism pathway. Subpathway- 
GM analysis yielded a P- value of 0.0021 (FDR corrected 
to 0.0081) for this subpathway, but the corresponding 
pathway was ignored by Pathway-G, -M and IMPaLA 
even at the 10% significance level (P = 0.16, 0.66 and 
0.1, respectively). The subpathway path:00562_l (Supple- 
mentary Figure S2), especially its starting region, which 
contains phosphoinositide 3-kinase and the tumor sup- 
pressor Phosphatase and tensin homolog (PTEN), was 
reported to be highly associated with tumor growth and 
survival (35,36). The end product of the subpathway, 
inositol, was differential in colorectal cancer and located 
in the center of the pathway. Inositol has been highly 
associated with the initiation of cancer and the control 
of cancer metastases (32,33,37). 

The fourth subpathway (path:00340_l) belonged to his- 
tidine metabolism, whereby histidine is converted by his- 
tidine decarboxylase to histamine, which is highly 
associated with cancer [reviewed in (38)]. Cianchi et al. 
(39) demonstrated that dual inhibition of the histidine de- 
carboxylase and cyclooxygenase (COX) subpathways in 
arachidonic acid metabolism might be as a possible thera- 
peutic tool for the treatment of colorectal cancer. 

The fifth subpathway (path:00590_l) belonged to ara- 
chidonic acid metabolism with Subpathway-GM P-value 
of 0.004 (FDR corrected to 0.011). Pathway-M, however, 
only gave a P-value of 0.88 because only one metabolite 
(arachidonic acid) was annotated to the pathway. The ara- 
chidonic acid pathway was composed of three key meta- 
bolic regions: COX, Lysyl oxidase (LOX) and Human 
Cytochrome P450 (CYP) (Figure 4), each of which has 
been implicated in several types of cancers, including colo- 
rectal and prostate cancers [reviewed in (40,41); see 
Supplementary Text]. Arachidonic acid is located in the 
central position of three key metabolic regions of the 
pathway. Subpathway-GM thus used the close positional 
relationship between differential arachidonic acid and 
nearby differential genes to identify one irregular 
subpathway (path:00590_l) that contained the certain 
part of the three regions (the * region in Figure 4), espe- 
cially, the Arachidonate lipoxygenases (ALOX5) region. 

Subpathway-GM tends to locate key regions represen- 
tative of the corresponding entire pathways. We further 
tested whether the subpathways identified as significant by 
Subpathway-GM tended to be representative of the cor- 
responding entire pathways in the colorectal cancer data 
set. We obtained all the nodes within the 26 significant 
subpathways and calculated their degrees and betweenness 
within the corresponding entire pathways. Degree and 
betweenness are the two most popular topological central- 
ity indexes. The degree and betweenness of nodes within 
subpathways were significantly higher than that for all 
nodes within entire pathways (Table 2). For example, 



the average degree of nodes within the significant 
subpathways was 4.00, which was significantly higher 
than the average degree of all nodes, which was 2.14 
(P = 4.05E-94; Wilcoxon rank-sum test). We also found 
that both genes and metabolites within significant 
subpathways tended to exhibit high topological centrality 
(Table 2). These results suggest that nodes within signifi- 
cant subpathways may be important in pathways in terms 
of both local (degree) and global (betweenness) topologies, 
and they indicate that the subpathways identified by 
Subpathway-GM tend to be representative of the entire 
pathways. 

We also examined these significant subpathways 
identified by Subpathway-GM from a biological point of 
view. We defined a subpathway as representative when it 
covered most core parts of the corresponding entire 
pathway, based on our biological knowledge. Nineteen 
(~73.07%) of the 26 subpathways showed high tendencies 
to be representative of the corresponding entire pathways 
(Supplementary Table SI). For example, the aforemen- 
tioned subpathways were representative of the correspond- 
ing tryptophan, arachidonic acid and histidine metabolism 
pathways, respectively (Table 1). Tryptophan was at the 
center of the tryptophan pathway, which contains three 
main regions: tryptophan-kynurenime, tryptophan-sero- 
tonin and tryptophan-tryptamine. Subpathway-GM effect- 
ively identified the core parts of these regions as 
subpathway path:00380_3 (Figure 3). The arachidonic 
acid pathway was composed of three key metabolic 
regions: COX, LOX and CYP, with arachidonic acid at 
the center of three regions. Subpathway-GM identified 
subpathway path:00590_l, including arachidonic acid, 
COX, LOX and CYP. Taken together, Subpathway-GM 
can thus not only identify cancer-related subpathways but 
also tends to locate key regions representative of entire 
pathways. 

Seven subpathways were not defined as representative 
of their corresponding entire pathways, but they were still 
reported to be highly associated with cancer 
(Supplementary Table SI). These included glycolysis/ 
gluconeogenesis and inositol phosphate metabolism, 
which displayed obvious local subpathway features (e.g. 
Supplementary Figures SI and S2). For example, 
subpathway path:00562_l belonged to the inositol phos- 
phate metabolism pathway. Inositol is at the center of this 
pathway, which included many paths to consume and 
release the metabolite. Because the subpathway only 
covered one of these paths, and does not cover most 
regions of the pathway, we did not consider this 
subpathway to be representative of the corresponding 
entire pathway. The subpathway path:00562_l (Supple- 
mentary Figure S2), however, especially the starting 
region of the subpathway, was reported to be highly 
associated with tumor growth and survival (35,36). The 
subpathway path:00010_l comprised the pyruvate and 
lactate metabolism region, which is located in the 
bottom region of the glycolysis/gluconeogenesis pathway 
(Supplementary Figure SI). Pyruvate is at the center of the 
pathway and showed a high degree and betweenness in the 
pathway. Lactate has been widely reported to be 
associated with cancer (Warburg effect) (32,33). The 
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Figure 4. Arachidonic acid metabolism pathway where the differential genes and metabolites of colorectal cancer were annotated. Nodes near 
asterisk symbol belong to the key subpathway (path:00590_l) region identified by Subpathway-GM. The region contained the certain parts of 
three subsystems: COX, LOX and CYP450. Most of the differential genes involved in LOX5 belonged to the LOX subsystem. 



Table 2. Degree and betweenness of nodes within significant subpathways and corresponding entire pathways 



Centrality 


Molecules 


Subpathway 


Entire pathway 


.P-values 


Degree 


Genes and metabolites 


4.00 


2.14 


4.05E-94 




Genes 


3.95 




5.96E-115 




Metabolites 


4.18 




5.05E-61 


Betweenness 


Genes and metabolites 


617.59 


212.18 


6.36E-113 




Genes 


614.29 




5.34E-82 




Metabolites 


658.47 




1.02E-46 



differential genes and metabolites involved in the afore- 
mentioned pathways were located in the local cascade 
region of pathways. Pathway-G, -M and IMPaLA 
tended to miss this kind of pathways because of the low 
ratios of the differential genes and metabolites involved, 
resulting in low enrichment significance. However, 
Subpathway-GM can effectively detect the biologically 
meaningful pathways by accurately identifying key 
subpathway regions through the joint power of differen- 
tial genes and metabolites, and their topologies within 
pathways. 

Colorectal cancer 2 

Our second data set was chosen to illustrate the reliability 
of the metabolic pathways identified by Subpathway-GM 
analysis in colorectal cancer data set 1 by analysis of gene 
expression in a second colorectal cancer data set 2. In this 



data set, we identified 1773 differentially expressed genes, 
including 792 (38.59%) of the genes identified in data set 
1 . Subpathway-GM analysis of the data set 2 detected 64 
potential subpathways from all metabolic pathways, 36 of 
which, corresponding to 33 entire pathways, were finally 
identified as significant at the 1% significance level. 
Twenty (80%) of 25 significant pathways found in the 
data set 1 were also in data set 2, although only 38.59% 
differential genes in data set 1 were also in data set 2. The 
overlap between pathways found in the data sets 1 and 2 
was also highly statistically significant (P < 1.00E-11; 
hypergeometric test). 

Most of the cancer-related pathways in data set 1, such 
as tryptophan metabolism, glycolysis/gluconeogenesis and 
arachidonic acid metabolism, were also identified in data 
set 2, indicating that Subpathway-GM can reliably identify 
disease-related metabolic pathways. Furthermore, we 
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compared the results of the two data sets 1 and 2 at the 
subpathway level. We used the Jaccard index to compute 
similarities between molecules (genes and metabolites) in 
the corresponding subpathways for 20 common pathways 
found in the data sets 1 and 2. The similarity values of 
17 pathways were >50%, and the average value for all 
20 pathways was 66%. These results suggest that the 
results of Subpathway-GM analysis were reliable at 
the subpathway level. Moreover, eight pathways in data 
set 2, including tryptophan metabolism, glycolysis/ 
gluconeogenesis and arachidonic acid metabolism, were 
identified by Subpathway-GM, but not by Pathway-G, 
-M or IMPaLA. 

Metastatic prostate cancer 

To further demonstrate the use of Subpathway-GM, we 
applied this analysis to a metastatic prostate cancer data 
set (12,22). Because the metabolic pathways associated 
with metastatic prostate cancer have not been widely 
reported, we expected to identify novel pathways in this 
data set. Subpathway-GM identified 16 significant 
subpathways at the 1% significance level, nine (56.25%) 
of which were involved in amino acid metabolism. Some 
studies suggested that amino acid metabolism might be 
highly associated with metastatic cancer (12,43), for 
example, 'glycine, serine and threonine metabolism' was 
reported to be highly associated with metastatic prostate 
cancer (12). Subpathway-GM yielded a P-value of 
0.000027 (FDR corrected to 0.00042) and identified the 
metastasis-related key subpathway path:00260_l 
(Supplementary Figure S3) (12). 

Of 16 pathways identified by Subpathway-GM, up to 
nine (56.25%) were ignored by Pathway-G, -M and 
IMPaLA simultaneously (Supplementary Table S2). The 
most significant additional subpathway (path:00380_l) 
contained the key tryptophan-serotonin regions in the 
tryptophan metabolism pathway (Supplementary Figure 
S4). Dizeyi et al. (30,31) demonstrated that serotonin 
can activate mitogen-activated protein kinase and PI3K/ 
Akt signaling pathways, which play an important role in 
prostate cancer progression, especially in androgen- 
independent state disease. In the third additional 
pathway 'cysteine and methionine metabolism', 
Subpathway-GM identified the metastasis-related key 
subpathway region (path:00270_3) involving methionine 
metabolites (Supplementary Figure S5). In the pathway, 
methionine metabolites, including cystathionine and 
cysteine, can significantly increase the ability to predict 
aggressive prostate cancer (43). Furthermore, methionine 
metabolism involves mechanisms for sarcosine formation 
(43), and Sreekumar et al. (12) identified sarcosine as a 
potential key metabolic intermediary of prostate cancer 
cell invasion and aggressivity. In addition, the arachidonic 
acid metabolism pathway, especially COX regions in the 
pathway, has been highly associated with prostate cancer 
progression [reviewed in (45)]. 

To the best of our knowledge, some of the additional 
pathways identified by Subpathway-GM, such as histidine 
metabolism, have not been reported in association with 
metastatic prostate cancer. As shown in Figure 5A, the 



histamine region in the pathway was accurately identified 
by Subpathway-GM and the subpathway yielded a 
P-value of 0.0016 (FDR corrected to 0.0094). Histamine 
was located in the central region in the subpathway, sug- 
gesting a potential high association with metastatic 
prostate cancer. We further explore this using a transwell 
chamber assay to detect the effect of histamine on cell 
migratory ability in vitro (see Supplementary Text). 
Briefly, the prostate cancer cell line DH145 was treated 
with different final concentrations of histamine 
(l-6umol/l) for 24 h. The result showed that low concen- 
tration histamine could promote prostate cancer cell mi- 
gratory ability and had a dose-dependent effect (Figure 5B 
and D). In contrast, high concentration histamine in- 
hibited the cell migration. To exclude the effect of hista- 
mine on cell viability, viability was determined by 3-(4,5- 
Dimethylthiazol-2-yl)-2,5-diphenyltetrazolium bromide 
(MTT) assay after treatment of the cells with histamine 
(see Supplementary Text). The result showed that hista- 
mine had no effect on cell viability (Supplementary Figure 
S6). Cells were treated with 3 umol/1 of histamine for dif- 
ferent period (0-24 h) to detect any time-dependent effects. 
The results confirmed that histamine promoted prostate 
cancer cell's ability of migration in a time-dependent 
manner (Figure 5C and 5E). In addition, several genes 
previously shown to play important roles in multiple 
cancers also emerged in the identified subpathway, 
including Histidine decarboxylase (HDC), Histamine N- 
methyltransferase (HNMT), Monoamine oxidase (MAO) 
and Aldehyde dehydrogenase [NAD(P)+] (ALDH) 
(Figure 5A). Overall, these results suggest that dysfunction 
of the histamine region may be highly associated with 
metastatic prostate cancer. 



DISCUSSION 

Metabolic pathways involve genes (enzymes) and metab- 
olites, which both play important roles in the functions of 
metabolic pathways related to studying conditions such as 
diseases (10-12). The integrative analysis of genes and me- 
tabolites at the pathway structure level will, therefore, help 
to locate and evaluate key metabolic subpathways. Based 
on this idea, we integrated genes and metabolites relevant 
to a given disease into pathways, and then identified key 
metabolic subpathways via cascades among signature 
nodes within the pathway structure. The resulting 
Subpathway-GM method was applied to differential 
genes and metabolites in colorectal cancer. The results 
showed that most of the pathways identified by 
Subpathway-GM were highly associated with the initi- 
ation and progression of colorectal cancer. Interestingly, 
many pathways corresponding to significant subpathways 
identified by Subpathway-GM were obviously ignored by 
classical pathway identification methods, such as 
Pathway-G, -M and IMPaLA. We focused on seven 
pathways that were identified by Subpathway-GM but 
not by any of the other three methods. Surprisingly, six 
pathways were highly associated with colorectal cancer, 
suggesting that Subpathway-GM was able to recall more 
colorectal cancer-associated pathways. Moreover, the 
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Figure 5. Analysis of the histamine region in histidine metabolism. (A) The histamine region (path:00340_l) is located in the center of the histidine 
metabolism pathway. Zoomed region displays the subpathway in detail. (B and D) Dose-dependent effect of histamine on migration detected using 
transwell chamber assay. Cell migration ability increased as histamine concentration increased. Prostate cancer cells showed the greatest migration at 
3 umol/1. (C and E) Cells treated with 3 umol/1 histamine were incubated for different periods (0-24 h). Histamine promoted prostate cancer cell 
migration in a time-dependent manner. Each experiment aforementioned was performed in triplicate. 



results obtained for one colorectal cancer data set at the 
entire pathway and subpathway levels were reliably 
reproduced in a second colorectal cancer data set. 
Application of Subpathway-GM to a metastatic prostate 
cancer data set not only successfully identified already- 
reported metastasis-related metabolic subpathway 
regions but also predicted multiple novel potential 
metastasis-related subpathways, such as histamine metab- 
olism. These results thus demonstrate the power of a joint 
gene/metabolite subpathway strategy based on their 
topology in terms of recalling and predicting more bio- 
logically meaningful pathways. 

Compared with other entire pathway identification 
methods, Subpathway-GM was able to locate key 
subpathway regions accurately via positional information 
of differential genes and metabolites within pathways. 
This strategy can not only locate key subpathway 
regions associated with diseases but also tends to 
identify the key regions representative of entire 
pathways. The key subpathway regions identified by 
Subpathway-GM contained fewer genes and metabolites 
than entire pathways, allowing researchers to use alterna- 
tive low-throughput technologies to confirm the local 
subpathway regions related to specific diseases (46). 

This study focused on finding disease-related 
subpathways by setting the parameter n according to the 
distance between known disease genes and metabolites. 



The parameter s was also set as 5 to increase the sensitivity 
of identifying potential disease-related subpathways. 
However, users would be able to vary these parameters 
according to their needs. For example, increasing s could 
be used to focus more on the representative subpathways 
by filtering out small subpathways because smaller 
subpathway are less likely to represent the corresponding 
entire pathway. In addition, n can indirectly influence the 
size of the located subpathway. Increasing n increases the 
size of the subpathway because lenient distance similarity 
tends to merge more nodes into the same subpathway. 
Moreover, the key nodes within entire pathways also 
tend more to be added to the located subpathway 
because of high topological centrality of key nodes. 
Subpathway-GM would thus be expected to identify the 
key representative regions in entire pathways. 

Our previous 4 k-cliques' method, which was similar to 
Pathway-G but uses pathway structure, was able to 
identify metabolic subpathways based on interesting 
genes (13). However, the method only used enzyme rela- 
tions (without considering metabolites) to mine regular 
'circle' shape subpathways. On the one hand, this 
method did not support joint input from interesting 
genes and metabolites like most of pathway identification 
methods, and thus ignored joint power of genes and me- 
tabolites. On the other hand, many disease-related 
subpathways displayed relatively irregular shapes in 
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pathways, especially when considering interesting metab- 
olite information. These irregular regions can usually be 
identified effectively by the Subpathway-GM, but not by 
the 'k-cliques' method, because of the different subpath- 
way mining strategies of the two methods. We further 
compared Subpathway-GM with the 'k-cliques' method 
using data sets 1, 2 and 3, and we found that 
Subpathway-GM was more effective in identifying 
subpathways as a result of the integrative use of genes 
and metabolites, the strategy for locating subpathway 
regions and control of subpathway overlap (see 
Supplementary Text). Subpathway-GM thus detected 16, 
13 and 10 significant pathways in data sets 1, 2 and 3, 
respectively, that were ignored by the 'k-cliques' method. 

The limitations of current metabolomic technology 
mean that fewer differentially metabolites are detected 
compared with differentially expressed genes, which may 
result in pathway analysis strategies tending to ignore me- 
tabolite information. However, metabolites may be 
located in important positions in pathways, and 
Subpathway-GM takes into account the importance of 
metabolites in locating and evaluating subpathways. 
Notably, if differential metabolites appear in an important 
position in the pathways, these metabolites and their 
nearby genes can play crucial roles in identifying key 
subpathway regions through the formation of signature 
nodes. For example, the differential arachidonic acid in 
the arachidonic acid metabolism pathway effectively 
helps to locate the key subpathway region composed of 
LOX, COX and CYP according to our subpathway 
strategy. Tryptophan in tryptophan metabolism provides 
another example. These metabolites were differential in 
colorectal cancer and were thus defined as signature 
nodes to locate key subpathways in Subpathway-GM. 

In analysis of data from 'omics' technologies such as 
microarrays, Pathway-G, -M and IMPaLA belong to the 
'cut-off-based' methods, which are useful for identifying 
pathways with strong changes reflecting major changes in 
genes or/and metabolites (e.g. differential genes and me- 
tabolites) (2,8). The 'cut-off-free' methods, such as GSEA 
and its versions (e.g. MSEA), are also useful because they 
can detect molecules with weaker changes, and thus 
identify biologically meaningful but seemingly weak 
pathways (2,8). Subpathway-GM is a 'cut-off-based' 
method but can also recall biologically meaningful, seem- 
ingly weak pathways that contain key subpathway regions 
with strong changed nodes. The key principle of 
Subpathway-GM is that seemingly weak pathways from 
the entire pathway view may still show strong changes in 
key subpathway regions. Through effectively setting the 
parameter n, Subpathway-GM can thus endure some 
genes and metabolites with seemingly weak changes. 
'Cut-off-free' methods are still not applied for the integra- 
tive pathway identification of genes and metabolites. A 
fact is that acquiring integrative rank list of genes and 
metabolites is usually difficult based on different 
technologies, such as microarray, GC-MS, liquid chroma- 
tography-MS and nuclear magnetic resonance. 
Subpathway-GM thus demonstrates a significant advan- 
tage in terms of the integrative pathway identification of 
genes and metabolites. Instead of needing to know rank 



lists or differential levels, users only need to provide a list 
of the molecules of interest. Subpathway-GM thus has 
considerable potential to complement ORA and GSEA 
in pathway identification, as a result of its utilization of 
both genes and metabolites and its subpathway strategy 
based on their topologies. 
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